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VARIANCE ESTIMATION FOR NEAREST NEIGHBOR 
IMPUTATION FOR US CENSUS LONG FORM DATA^ 

By Jae Kwang Kim, Wayne A. Fuller and William R. Bell 

Iowa State University, Iowa State University and US Bureau of Census 

Variance estimation for estimators of state, county, and school 
district quantities derived from the Census 2000 long form are dis- 
cussed. The variance estimator must account for (1) uncertainty due 
to imputation, and (2) raking to census population controls. An impu- 
tation procedure that imputes more than one value for each missing 
item using donors that are neighbors is described and the procedure 
using two nearest neighbors is applied to the Census long form. The 
Kim and Fuller [Biometrika 91 (2004) 559-578] method for variance 
estimation under fractional hot deck imputation is adapted for appli- 
cation to the long form data. Numerical results from the 2000 long 
form data are presented. 

1. Introduction. In Census 2000 income data were collected on the long 
form that was distributed to about one of every 6 households in the United 
States. These data were used to produce various income and poverty esti- 
mates for the US, and for states, counties, and other small areas. The state 
and county income and poverty estimates from the Census 2000 long form 
sample have been used in various ways by the Census Bureau's Small Area 
Income and Poverty Estimates (SAIPE) program. The poverty estimates 
produced by SAIPE have been used by the US Department of Education in 
allocating considerable federal funds each year to states and school districts. 
In 2008 the Department of Education used SAIPE estimates, directly and 
indirectly, to allocate approximately $16 billion to school districts. 

The Census 2000 long form had questions for eight different types of 
income for each individual in a household. (For details, see Table 1 in Sec- 
tion 5.) If there was nonresponse for an income item, a version of nearest 
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neighbor imputation (NNI) was used, where the nearest neighbor was de- 
termined by several factors such as response pattern, number of household 
members, and other demographic characteristics. NNI is a type of hot deck 
imputation that selects the respondent closest, in some metric, to the non- 
respondent, and inserts the respondent value for the missing item. Most 
imputation rates for income items in the Census 2000 long form data were 
more than double the corresponding imputation rates from the 1990 cen- 
sus [Schneider (2004), pages 17-18, and Table 1, page 27]. For example, the 
Census 2000 imputation rate for wage and salary income was 20%, while in 
1990 it was 10%, and for interest and dividend income the imputation rates 
were 20.8% in 2000 and 8.1% in 1990. OveraU, 29.7% of long form records in 
2000 had at least some income imputed, compared to 13.4% in 1990. Given 
the 2000 imputation rates, it is important that variance estimates for income 
and poverty statistics reflect the uncertainty associated with the imputation 
of income items. 

The Census Bureau performed nearest neighbor imputation for eight in- 
come items in producing the long form estimates. The estimation procedure 
had been implemented and the estimates were not subject to revision. Our 
task was to estimate the variances of the existing long form point estimates 
that are used by the SAIPE program. The problem is challenging because 
of the complexity of the estimates. While total household income is a simple 
sum of the income items for persons in a household, and average household 
income (for states and counties) is a simple linear function of these quanti- 
ties, our interest centers on (i) median household income, and (ii) numbers 
of persons in poverty for various age groups. Poverty status is determined by 
comparing total family income to the appropriate poverty threshold, with 
the poverty status of each person in a family determined by the poverty 
status of the family. For such complicated functions of the data, the effects 
of imputation on variances are difficult to evaluate. 

It is well known that treating the imputed values as if they are observed 
and applying a standard variance formula leads to underestimation of the 
true variance. Variance estimation methods accounting for the effect of im- 
putation have been studied by Rubin (1987), Rao and Shao (1992), Shao 
and Steel (1999), and Kim and Fuller (2004), among others. Sande (1983) 
reviewed the NNI approach, Rancourt, Sarndal, and Lee (1994) studied NNI 
under a linear regression model, and Fay (1999) and Rancourt (1999) con- 
sidered variance estimation in some simple situations. Chen and Shao (2000) 
gave conditions under which the bias in NNI is small relative to the stan- 
dard error and proposed a model-based variance estimator. Chen and Shao 
(2001) described a jackknife variance estimator. Shao and Wang (2008) dis- 
cussed interval estimation and Shao (2009) proposed a simple nonparametric 
variance estimator. 
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Our approach to estimating variances under NNI is based on the fractional 
imputation approach suggested by Kalton and Kish (1984) and studied by 
Kim and Fuller (2004). In fractional imputation, multiple donors, say, M, are 
chosen for each recipient. We combine fractional imputation with the nearest 
neighbor criterion of selecting donors, modifying the variance estimation 
method described in Kim and Fuller (2004) to estimate the variance due 
to nearest neighbor imputation. Replication permits estimation of variances 
for parameters such as median household income and the poverty rate. Also, 
replication is used to incorporate the effect of raking, another feature of the 
estimation from the Census 2000 long form sample. 

It should be noted that the official estimation and imputation procedures 
for the long form were fixed and production was completed before the re- 
search described here was even started. Hence, our objective was to develop 
variance estimates, accounting for imputation and raking, for the produc- 
tion point estimates, not to explore alternative imputation procedures in 
an attempt to improve the point estimates. Thus, we used M = 2 nearest 
neighbor imputations in developing variance estimates for the production 
long form estimates that used M = 1 nearest neighbor imputation. 

The paper is organized as follows. In Section 2 the model for the NNI 
method and the properties of the NNI estimator are discussed. In Section 3 
a variance estimation method for the NNI estimator is proposed. In Section 4 
the proposed method is extended to stratified cluster sampling. In Section 5 
application of the approach to the Census 2000 long form income and poverty 
estimates is described. 

2. Model and estimator properties. Our finite universe U is the census 
population of the United States. The Census Bureau imputation procedure 
defines a measure of closeness for individuals. Let a neighborhood of individ- 
ual g be composed of individuals that are close to individual g, and let Bg be 
the set of indices for the individuals in the neighborhood of individual g. We 
assume that it is appropriate to approximate the distribution of elements in 
the neighborhood by 

(1) yj'''"^' {f^g,(^g), 3^Bg, 

where denotes independently and identically distributed. Chen and Shao 
(2000) have given conditions such that it is possible to define a sequence of 
samples, populations, and neighborhoods so that the distribution of yi can 
be approximated by that of (1). See also Section B in the supplemental 
article [Kim, Fuller, and Bell (2010)] for an alternative justification of (1). 
These conditions do not necessarily hold for our population because the 
neighbors are defined by discrete variables. If response is independent of y 
and if the value of the discrete variables are the same for all elements in Bg , 
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then (1) holds when the original observations are independent. We feel (1) is 
reasonable because the sample is large relative to a neighborhood composed 
of three sample individuals. We assume that response is independent of the 
y-values so that the distribution (1) holds for both recipients and donors. 

Let On be an estimator based on the full sample. We write an estimator 
that is linear in y as 

where A is the set of indices in the sample and the weight Wi does not depend 
on y,. An example is the estimated total Ty = Sjgyi'7rj~^yi, where iTi is the 
selection probability. Let V{6n) be the variance of the full sample estimator. 
Under model (1) we can write 

Vi — fJ'i ~^ 

where the Cj are independent (0, cj?) random variables and is the neighbor- 
hood mean. Thus, fii = fig and = a'^ for i € Bg. Then, under model (1) and 
assuming that the sampling design is ignorable under the model in the sense 
of Rubin (1976), the variance of a linear estimator of the total Ty — Xyigc Vi 
can be written 

V\ ^ WiVi -Ty\=vlY^ Wifii -^fiii +eI Yiwf - Wi)af 

Assume that y is missing for some elements and assume there are always at 
least M observations on y in the neighborhood of each missing value, where 
in the Census long form application, M = 2. Let an imputation procedure 
be used to assign M donors to each recipient. Let w*^ be the fraction of the 
original weight allocated to donor i for recipient j, where Yli'^ij If we 
define 

^ _ i if is used as a donor for yj, 

1^ 0, otherwise, 

then one common choice for w^j is w*j = M~^dij for i ^ j- Then 

ai = Wi + ^ WjW*j = ^ WjWij 
j^i jeA 

is the total weight for donor i, where it is understood that w*^ = 1 for a donor 
donating to itself. Thus, the imputed linear estimator is 

0j = Ywjyij = ^ aiyi, 
jeA igAr 
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where Aji is the set of indices for the respondents and the mean imputed 
value for recipient j is 

(2) yij = ^w*jyi. 

Note that yij = yi if j is a respondent. Then, under model (1), 

(3) vi9j - Ty) = v{Y.'^^^'^ -Y^A+AY. - «*)^'|' 

where Aji is the set of indices of respondents. The variance expression (3) 
is smaller for larger M, 1 < M < hr, as long as model (1) holds for the M 
nearest neighbors. See Kim and Fuller (2004). 

3. Variance estimation. Let the replication variance estimator for the 
complete sample be 

L 

(4) v0) = Y^c,{e^'')_ef, 

k=l 

where 9 is the full sample estimator, ^^'^^ is the kth. estimate of 9]\[ based on 
the observations included in the A;th replicate, L is the number of replicates, 
and Cfc is a factor associated with replicate k determined by the replication 
method. Assume that the variance estimator V{9) is design unbiased for the 
sampling variance of 9. If the missing y^ are replaced in (4) with yjj of (2), 
the resulting variance estimator Kiaivcl^) satisfies 

(5) E{V,,,,,,{9)} = vlj2'^ifi,-J2f^i] +EiYl E Cfc(aS? - a 
where a-^^ = Wj^^w^j and Wj^^ is the weight for element j in replicate k. 

(k) 

The weights are called the naive replication weights. 

We consider a procedure in which the individual w*j are modified for the 
replicates, with the objective of creating an unbiased variance estimator. 

Let Wj^j be the replicated fractional weights of unit j assigned to donor i 
at the kth replication. Letting 

aik) sr^ (k) 

ieAR 

, (fc) (fe) , (fc) *(k) ^ [k) *(k) , n 

where al = wl + 2_jj^i'^j ''^ij = l^jeA^j '^ij ' define a variance es- 
timator by 

L 




V{9i) = Tck{9f'^ -9if. 



k=l 
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The expectation of the variance estimator V{9i) is 



E{V{ei)} = E 



(6) 



,fc=l ^ieAu 



(fc) 



+ E 



i&An lfc=l J 



Because the w*^''^ satisfy 



(7) 



= 1 



for all j, then, under the model (1), ignoring the smaller order terms, 



L k=l '-ieAj: 



(fc) 



n 2 



^ E Ek -«^^)^* 

La:=1 SeA 

^ic A itzTJ / 



Thus, the bias of the variance estimator V{9i) is 



Bias{V{ei)} = El 



E^^-( 

.fc=i 



a. 



(fc) 



If the replicated fractional weights were to satisfy 

L 

(8) ^ Cfc(af ^ - aif = aj - ai 



k=l 



for all i € A_r, then the bias would be zero. However, it is difficult to define 
replicate weights that satisfy (8). Therefore, we consider the requirement 



(9) E^4( 

fc=i 



(k) .2 

a) — a,- 



+ E( 



(k) 



at 



af — a., 



+ E 



where Dju = {t; Xljgyi^^ dijdtj = 1, t 7^ i} is the set of donors, other than i, to 
recipients from donor i. Under assumption (1), the recipients in the neigh- 
borhood of donor i have common variance and (9) is a sufficient condition 
for unbiasedness. 

We outline a replication variance estimator that assigns fractional repli- 
cate weights such that (7) and (9) are satisfied. There are three types of 
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observations in the data set: (1) respondents that act as donors for at least 
one recipient, (2) respondents that are never used as donors, and (3) recipi- 
ents. The naive rephcate weights defined in (5) will be used for the last two 
types. For donors, the fractional weights w^j in replicate k will be modified 
to satisfy (7) and (9). 

We first consider jackknife replicates formed by deleting a single element. 
The next section considers an extension to a grouped jackknife procedure. 
Let the superscript k denote the replicate where element k is deleted. First 
the replicates for the naive variance estimator (5) are computed, and the 
sum of squares for element i is computed as 

L 
k=l 

where a^^^ is defined following (5). 

In the second step the fractions for replicates for donors are modified. Let 
the new fractional weight in replicate k for the value donated by k to j be 

(10) wlf^=wl^{l-bk), 

where is to be determined. Let t be one of the other M — 1 donors, other 
than k, that donate to j. Then, the new fractional weight for donor t is 

(11) w;p=wt^ + {M-ir'bkwi^. 

For M = 2 with wlj = wl^ = 0.5, wl^p = 0.5(1 - bk) and wlf^ = 0.5(1 + 6^)- 
For any choice of 6^, condition (7) is satisfied. The variance estimator will 
be unbiased if 6^ satisfies 




1 ^ 

aj? -at + bk{M - l)-i 

- Y ^ki^ti - atf = al - Ok - (pk, 

where Djik is defined following (9). The difference a| — ak — (pk is the differ- 
ence between the desired sum of squares for observation k and the sum 
of squares for the naive estimator. Under the assumption of a common 
variance in a neighborhood and the assumption that the variance estima- 
tor V^(^) of (4) is unbiased for the full sample, the resulting variance esti- 
mator with w*j''^ defined by (10)-(12) is unbiased for the imputed sample. 
An illustration of the construction of replicates for variance estimation is 
provided in Section A of the supplement [Kim, Fuller, and Bell (2010)]. 



(12) + E 
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4. Extension. The proposed method in Section 3 was described under 
the situation where the jackknife rephcates are formed by deleting a single 
element. In practice, grouped jackknife is commonly used where the jackknife 
replicates are often created by deleting a group of elements. The group can 
be the primary sampling units (PSU) or, as in the Census long form case, 
groups are formed to reduce the number of replicates. In the discussion we 
use the term PSU to denote the group. To extend the proposed method, 
assume that we have a sample composed of PSUs and let PSU k be deleted 
to form a replicate. Let Vk be the indices of the set of donors in PSU k that 
donate to a recipient in a different PSU. For fractional imputation of size M, 
let the fractional replication weight in replicate k for the value donated by 
element i in PSU k to j be 

(13) wf^ = <^.(1 - bk) if i G n and M / Mj^, 

where b^ is to be determined and Mj^ = X^jg-p^ dij is the number of donors 
to recipient j that are in PSU k. Note that (13) is a generalization of (10). 
The corresponding replication fraction for a donor to a recipient j, where 
the donor is not in PSU k, is 

= w't*,- (1 + ^jkbkdij ) for t G V'^ and leVk, 



where 



The determining equation for 6^ is 

X X Cfe <^ af^ - at + bk X wf^dijAjkW^j [ - {a 



(fc) n2 



k 



{of - ai - (pi] , 

which generalizes (12). Here, we assume common variances for the units in 
the same PSU. 

We extend the fractional nearest neighbor imputation to the case of Mi 
fractions for point estimation and M2 (>Mi) fractions for variance esti- 
mation. The motivation for this extension is the application to the Census 
long form where the official estimates are based on a single imputed value. 
A second imputed value was generated to be used only in variance estima- 
tion. Let diij and d2ij be the donor-recipient relationship indicator function 
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used for point estimation and for variance estimation, respectively. Also, let 
wl^j and W2ij be the fractional weights of recipient j from donor i that are 
computed from duj and d2ij, respectively. For missing unit j, one common 
choice is wl^j = dujM^^ and = d2ijM2^ ■ Of particular interest is the 
case where Mi = 1 and M2 = 2. 

If Ml 7^ M2 , the variance estimator is defined by 



(14) m)=x:c,(^f 

k=l 

where 



with a = Wj'^wllj^ and an = J2j Wjwl^j. Here, w^lj^ is the rephcated 
fractional weight of unit j assigned to donor i in the kth replication. Note 



Bias{V} 




(k) 

that 61 is based on the point estimation weights and is based on the 
variance estimation weights. If w^2ij satisfy (7), the bias of the variance 
estimator (14) is 

" L 

^Cfc(aj^2^ - anf - (afi - an) 

.k=l 

Thus, condition (9) for the unbiasedness of the variance estimator is changed 
to 

(15) Yj^k\{ai2-<^iif+ Yl (al?-aii)H = a?i-«ii+ ("ti-aii)- 
k=l ^ t&Dm ^ tGDm 

To create the replicated fractional weights satisfying (7) and (15), the 
sum of squares of the naive replication weights is first computed, 
L 

^Cfe(a-i^ - aii)^ = (/-ii, iGAr, 

k=l 

where a^^ = Ylj^zA'^^P ''^uj- second step the fractions for replicates 

for donors in the point estimation are modified. Let the new fractional weight 
in replicate k for the value donated hy i GVk to j be 

= ^uji^ -bk), if i G Vk and Ah + Ma^-fc, 

where 6fc is to be determined and M2jk = Ylii=:'Pk '^^u- ^ow, M2 (>Mi) donors 
are identified for variance estimation. The new fractional weight for the other 
M2 — 1 donors to recipient j, denoted by i, is 

(16) wll.f = wltj + AjkbkdiijW2tj for t G and i G V^, 
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A 



■jk 



Then the 6^ that gives the correct sum of squares is the solution to the 
quadratic equation 

2 



( (k) x2 

-ail) 



If Ml = 1, the adjustment in the rephcation fractional weights can be 
made at the individual level. Let the new fractional weight in replicate k for 
the value donated hy i £Vk to j, j gV^, be 



ifieVk and Ma/Ms^fc, 



where bi is to be determined. The new fractional weight for each of the other 
M2 — 1 donors to recipient j, denoted by t, is 



*{k) 



wltj + AjkbidujW2tj for t G and ieVk, 



where Aj^ is defined following (16). Then the 6j that gives the correct sum 
of squares is the solution to the quadratic equation 

2 



(fc) * 



[a 



(k) 

il 



an 



I a|i^ - an + bi ^ wf^'Ajkdujwf.tj > 



I (k) n2 



: Oij - a\i 



5. Application to US Census long form data. 



5.1. Introduction. We use long form data from the states of Delaware and 
Michigan to provide examples of the variance estimation methods. Table 1 
shows the individual income items and their state level imputation rates for 
Delaware and Michigan. 
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Table 1 

Imputation rate and the person-level average income for each income item 
(age> 15) for two states, Delaware (n = 87,280j and Michigan 
fn= 1,412,339; 



Delaware Michigan 





Imputation 


Average 


Imputation 


Average 


Income item 


rate (%) 


income 


rate (%) 


income 


Wage 


20 


21,892 


21 


20,438 


Self employment 


10 


1286 


10 


1234 


Interest 


22 


1989 


22 


1569 


Social security 


20 


1768 


20 


1672 


Supplemental security 


20 


125 


20 


148 


Public assistance 


19 


38 


19 


47 


Retirement 


20 


2018 


20 


1664 


Otiier 


19 


543 


19 


529 


Total 


31 


29,659 


31 


27,301 



The sampling design for the Census 2000 long form used stratified sys- 
tematic sampling of households, with four strata in each state. Sampling 
rates varied from 1 in 2 for very small counties and small places to 1 in 8 
for very populous areas. 

The weighting procedure for the Census 2000 long form was performed 
separately for person estimates and for housing unit estimates. For the in- 
come and poverty estimates considered here, the person weights are needed. 

The census long form person weights are created in two steps. In the first 
step, the initial weights are computed as the ratio of the population size (ob- 
tained from the 100% population counts) to the sample size in each cell of 
a cross-classification of final weighting areas (FWAs) by person types [Hous- 
ing unit person. Service Based Enumeration (SBE) person, other Group 
Quarters (GQ) person]. Thus, the initial weights take the form of post- 
stratification weights. The second step in the weighting is raking, where, for 
person weights, there are four dimensions in the raking. The dimensions are 
household type and size (21 categories), sampling type (3 categories), house- 
holder classification (2 categories), and Hispanic origin/race/sex/age (312 
categories). Therefore, the total number of possible cells is 39,312, although 
many cells in a FWA will be empty. The raking procedure is performed 
within each FWA. There are about 60,000 FWAs in the whole country and 
the FWAs are nested within counties. 

5.2. Computational details. The variance estimation methodology is ba- 
sed on the grouped jackknife, where the method described in Section 3 is 
used to estimate the variance due to imputation. We summarize the main 
steps of variance estimation and then discuss the steps in more detail: 
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Step 1: Create groups and then define initial replication weights for the 
grouped jackknife method. The elements within a stratum are systemat- 
ically divided into groups. A replicate is created by deleting a group. 

Step 2: Using the initial replication weights, repeat the weighting procedure 
to compute the final weights for each replicate. 

Step 3: Using fractional weighting, modify the replicate weights to account 
for the imputation effect on the variance. In the process, a replicate im- 
puted total income variable is created for each person with missing data. 

Step 4: Using the replicate total income variables, compute the jackknife 
variance estimates for parameters such as the number of poor people by 
age group and the median household income. 

In step 1, the sample households in a final weighting area are sorted by 
their identification numbers, called MAFIDs. Let n be the sample number 
of households in a final weighting area. The first n/50 sample households 
are assigned to variance stratum 1, the next n/50 sample households are 
assigned to variance stratum 2, and so on, to create 50 variance strata. 
Within each variance stratum, the sample households are further grouped 
into two groups by a systematic sample of households arranged in a half- 
ascending-half-descending order based on the MAFID. Using the two groups 
in each of the 50 strata, L = 100 replication factors are assigned to each 
unit in the sample. For unit i in variance stratum h (/i = 1,2, . . . ,50), the 
replication factor for the replicate formed by deleting group k in variance 
stratum h is 

{1, if unit i does not belong to variance stratum h, 

2 — 6i, if unit i belongs to variance stratum h and i ^ Vhk ; 
6i, if unit i € P/ifc, 

where Si = l — {(1 — l/wio)0.5}^/^, WiQ is the initial weight of unit i, and Vhk 
is the set of sample indices in group k in variance stratum h. With this 
replication factor, Ck of (4) is one. 

In step 2, the step 1 replication weights are modified using the production 
raking operation. The weighting procedure consists of two parts. The first 
part is a poststratification in each final weighting area and the second part 
is raking ratio estimation using the short form population totals as controls. 
If the raking was carried to convergence, the estimated variance for controls 
would be zero. In the actual operation, the replicated final weights produce 
very small variance estimates for the estimates of the population controls. 

In step 3, a second nearest neighbor is identified for each nonrespondent 
for each income item. There are eight income items — see Table 1 given ear- 
lier. A fractional weight of one is assigned to the imputed value from the 
first donor and a fractional weight of zero is assigned to the imputed value 
from the second donor for production estimation. The fractional weights 



NEAREST NEIGHBOR IMPUTATION 



13 



are changed for the rephcate, when the jackknife group containing the first 
donor is deleted. The amount of change is determined so that conditions (7) 
and (9) are satisfied. Rephcate fractional weights are constructed separately 
for each income item. 

Once the replicated fractional weights are computed, replicates of the 
person-level total income are constructed. Let Yus be the sth income item 
for person i in family t and let Rus be the response indicator function for Yus- 
For the fcth replicate, the replicated total income for person i in family t 
is 

8 

(17) T/iVClf = Y,{RusYus + (1 - Rus)Y,f^}, 

s=l 

where 1"^^^ is the kth replicate of the imputed value for Yus, defined by 

(''^tisa^'^tisb) vector of the two kth. replicate fractional weights, one 

for the first donor and one for the second donor, for the sth income item, 
and (^tisa' -^tisb) vector of the imputed values of Yus from the first 

and second donor, respectively. The kth replicate of total family income for 
family t is 

mt 

(18) TINCl''^ = TINC^^ , 

i=l 

(k) 

where mt is the number of people in family t and TINCi^ is defined 
in (17). 

For the age group poverty estimates, a poverty status indicator function is 
defined for the family, and applies to all family members. That is, all family 
members are either in poverty or all are not in poverty. The poverty status 
indicator for family t is defined as 

r 1, iiTINCt<cu 
\ 0, if TINCt > ct, 

where, as with the replicates in (17), 

mt 8 

TINCt = T.^RtisYt^s + (1 - Rus)YtlJ 

4 = 1 S=l 

is the total income of family t, where Y^*^^ is the imputed value for Ytis using 
the first nearest donor, and Ct is the poverty threshold value for family t. 
The threshold is a function of the number of related children under 18 years 
of age, the size of the family unit, and the age of the householder. (Poverty 
thresholds for all recent years are available on the Census Bureau web site 
at http : //www . census . gov/hhes/www/poverty/threshld.html.) 
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To compute the replicate of (t, we use the following procedure: 
1. For person i in family t, compute two total incomes, TINCua and TINCtib, 

by 

8 

TINCtia = J2{RusYtis + (1 - Rtis)YtlJ, 
s=l 

8 

TINCub = Y^lRtisYtis + (1 - Rtis)Ytl,}. 

s=l 

Also, compute the two total family incomes 

mt 

{TINCta, TINCtb) = J^{TINCua, TINCub). 

i=l 

Using the replicated total family income TINC^^^ defined in (18), define 



(^o,^ Jf^) r/ATCf ) - TINCtb ^.^^ , ^.^^ 



and a"^'^ = 1 otherwise. The a|'^^ is the weight satisfying 

TINCf^ = af ^ TINCta + (1 - af^)TINCtb- 
2. The replicated poverty status variable is now computed by 

(20) Cf ^ = at^POVta + (1 - af^)POVtb, 

where POVta is computed by 



POV 



ta ■ 



1, \iTINCta<Cu 
0, if TINCta > Ct 



and POVtb is computed similarly using TINCub- 

The replication adjustment aj*^^ is computed from family- level total income 
and is applied in (20) to get a replicated poverty estimate. 

The estimated variance for the estimated total number of people in poverty 

is 



(21) y^ = ^(^W_^(-) 



k=l 

where L is the number of replications (here L = 100), 

n mt I L 

t=l i=l k=l 



NEAREST NEIGHBOR IMPUTATION 15 

C,j:^^ is defined in (20), and wj^^ is the person level replication weight after 
the raking operation. 

The number of people in poverty in a given age group can be estimated by 

n mt 
t=l i=l 

where zu = 1 if the person i in family t belongs to the age group and zu = 
otherwise. The A;th replicate of the estimate is 

n mt 

t=i 1=1 

and the variance is estimated by (21) using 9^) defined above. 

The variance estimation for median household income estimates is based 
on the test-inversion methodology described in Francisco and Fuller (1991). 
Also, see Woodruff (1952). Let MED be the estimated median household 
income defined by MED = F~^(0.5), where F{-) is the estimated cumulative 
distribution function of total income of the household, 

/ n \ -1 n 

F{u) =[Y,wtt\ Yl ^ 

\t=l ) t=\ 

wtt is the householder's person weight in household t, and TINCt is the 
total income of household t. (Note that households differ from families. The 
former includes all persons living in a given housing unit; the latter includes 
only related persons living in a housing unit.) 

To apply the test-inversion method, first create the replicated indicator 
variable 

INvf^ = afhNVta + (1 - af^)INVtb, 



where a\ is defined in (19) and 



INV 



mt 



ta 



1, if TINCtia < MED, 

i=l 
mt 

0, if Y TINCua > MED 



i=l 

and INVtb is computed similarly, using TINCub instead of TINCua in the 
above expressions. 

The estimated variance of the estimated proportion F{MED) = 0.5 is 
computed by applying the variance formula (21) using INVf^ instead of cf ^ 
to get Vinv Define 



(Pl,P2) = (0.5 - 2VMnv,0.5 + 2 V Vinv) 
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Table 2 

Variance estimation results for Delaware and Michigan 



Delaware Michigan 



Parameter 


Method 


Est. SE 


Std. SE 


Est. SE 


Std. SE 


01 


Naive 


870 


100 


3217 


100 


(total in poverty) 


Imputation 


1161 


133 


4096 


127 




Naive 


221 


100 


776 


100 


(0-4 in poverty) 


Imputation 


260 


118 


897 


116 




Naive 


366 


100 


1314 


100 


(5-17 related in poverty) 


Imputation 


467 


128 


1640 


125 


04 


Naive 


458 


100 


1608 


100 


(0-17 in poverty) 


Imputation 


592 


129 


2062 


128 


Median 


Naive 


177 


100 


70 


100 


HH income 


Imputation 


207 


117 


85 


121 



to be an approximate 95% confidence interval for tlie estimated proportion 
F{MED) = 0.5. Tlie estimated variance of the estimated median is 

Vmed = {i^"'(p2)-F-l(pi)}Vl6. 

5.3. Numerical results. Variance estimates for the long form income and 
poverty estimates that have been used by SAIPE were computed for all 
50 states of the US (plus DC) and their counties. The estimates considered 
here are the total number of people in poverty, the number of children under 
age 5 in poverty (state level only), the number of related children age 5-17 
in families in poverty, the number of children under age 18 in poverty, and 
the median household income. 

Table 2 contains variance estimation results (the estimated standard de- 
viations) for the income and poverty statistics for the states of Delaware and 
Michigan. The variance estimator labeled "naive" treats the imputed values 
as observed values. The "imputation" variance estimator is that of Section 3 
and reflects the imputation effects. Both variance estimators account for the 
raking in the estimator. Because Michigan is much larger than Delaware, 
its estimated numbers of persons in poverty (not shown) are much larger, 
and thus, due to the scale effects, so are the corresponding standard errors. 
The standardized standard errors in the table are computed by dividing the 
estimated standard error computed by the "imputation" procedure by the 
estimated standard error computed by the "naive" procedure. 

Generally speaking, imputation increases the variance so the naive vari- 
ance estimator underestimates the true variance. The relative increase is 
similar for Michigan and Delaware. A result worth noting is that the in- 
crease in variance due to imputation is higher for the poverty parameters 
than for the income parameters. This is because in both states the imputa- 
tion rate is higher for persons with low imputed income. (See Table 3.) 
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Table 3 

Imputation rates by income level (age >15) 



Total income 


Imputation rate (%) 


Delaware 


Michigan 


0-9999 


34 


34 


10,000-19,999 


36 


35 


20,000-49,999 


28 


29 


50,000-69,999 


25 


25 


70,000 and over 


25 


25 



Table 4 contains some numerical results for the estimated standard errors 
for the county estimates in Delaware. The age groups in the table are those 
used by SAIPE at the county level, which are fewer than the age groups used 
by SAIPE at the state level. As with state estimates, imputation increases 
the variance. However, the effect of imputation is much smaller for county 

Table 4 



County variance estimates for Delaware 



County 


Parameter 


Method 


Est. SE 


Std. SE 


001 


Oi 


Naive 


409 


100 




(total poor) 


Imputation 


444 


109 




03 


Naive 


183 


100 




(5-17 related poor ) 


Imputation 


203 


111 






Naive 


219 


100 




(0-17 poor) 


Imputation 


241 


110 




Median 


Naive 


323 


100 




HH income 


Imputation 


336 


104 


003 


di 


Naive 


687 


100 




(total poor) 


Imputation 


838 


122 




O3 


Naive 


317 


100 




(5-17 related poor) 


Imputation 


351 


111 




O4 


Naive 


365 


100 




(0-17 poor) 


Imputation 


417 


114 




Median 


Naive 


200 


100 




HH income 


Imputation 


226 


113 


005 


ei 


Naive 


518 


100 




(total poor) 


Imputation 


608 


117 




^3 


Naive 


197 


100 




(5-17 related poor) 


Imputation 


217 


110 




O4 


Naive 


270 


100 




(0-17 poor) 


Imputation 


300 


111 




Median 


Naive 


361 


100 




HH income 


Imputation 


389 


108 
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Table 5 

Donor distribution for wage income m Delaware (age > 15^ 



County 


Number of donors 
from county 1 


Number of donors 
from county 3 


Number of donors 
from county 5 


1 


1271 


1512 


325 


(n= 15,735) 


(41%) 


(49%) 


(10%) 


3 


1142 


7374 


1343 


(n = 51,869) 


(11%) 


(75%) 


(14%) 


5 


847 


1137 


2045 


19,661) 


(21%) 


(28%) 


(51%) 



estimates than for state estimates. County level estimation is an example 
of domain estimation, where the values used for imputation can come from 
donors outside the domain. Donors from outside the domain contribute less 
to the imputation variance of the domain total than donors in the domain 
because the imputed value from outside the domain is uncorrelated with 
the values observed in the domain. In effect, imputations from outside the 
domain increase the sample size on which the estimates are based, whereas 
imputations from inside the domain change the weights given to the obser- 
vations in the estimates. Because the proportions of outside donors differ 
across counties, the effect of imputation on county variances is not uniform 
across counties. In Delaware, the overall imputation rates for total income 
(the percent of records with at least one income item imputed) are 30.7%, 
29.5%, and 34.5% for county 1, county 3, and county 5, respectively. Table 5 
presents the distribution of donors for wage income in Delaware. In county 1, 
about 59% of the donors are from outside the county, whereas in county 3, 
only about 25% of the donors are from outside the county. Thus, the vari- 
ance inflation due to imputation, as reflected by the standardized standard 
error, is greater for county 3 than for county 1. 
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SUPPLEMENTARY MATERIAL 

Supplement A: Illustrated calculations (DOI: 10.1214/10-AOAS419SUPPA; 
.pdf). We illustrate the construction of replicates for variance estimation 
with a simple example where a simple random sample of original size six is 
selected with two missing values and two donors per missing value. 
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Supplement B: Justification for (1) (DOI: 10.1214/10-AOAS419SUPPB; 
.pdf). We provide a justification for (1) based on the large sample theory. 
The assumptions and the proof for (1) are provided. 

Supplement C: Proofs (DOI: 10.1214/10-AOAS419SUPPC; .pdf). Proofs 
for equations (3), (5), and (6) are provided. 
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